Skip to content

Conversation

@Alcpz
Copy link
Contributor

@Alcpz Alcpz commented Nov 5, 2025

While testing #16739, perplexities for LFM2 skyrocketed. @ggerganov pointed out that some matrix shapes would probably not be supported.

LFM2 has some layers that have two batches, so MAT_MULs were only done partially, leading to incorrect results. See #16739 (comment)

This patch adds basic support for tensors with ne2 > 1, with very naive chunking based on the non repack MUL MAT.

Perplexities using this patch:

# REPACK ON
[1]9.9763,[2]15.1558,[3]13.9708,[4]13.7465,[5]14.2039,[6]14.6234,[7]14.6543,[8]15.6984,[9]16.7691,[10]17.1773,[11]16.9814,[12]17.2111,[13]17.7539,[14]17.2013,[15]16.8515,[16]16.9276,[17]15.8386,[18]16.1010,[19]15.9863,[20]15.7344,
Final estimate: PPL = 15.7344 +/- 0.63198
# REPACK OFF
[1]9.9763,[2]15.1558,[3]13.9708,[4]13.7465,[5]14.2039,[6]14.6234,[7]14.6543,[8]15.6984,[9]16.7691,[10]17.1773,[11]16.9814,[12]17.2111,[13]17.7539,[14]17.2013,[15]16.8515,[16]16.9276,[17]15.8386,[18]16.1010,[19]15.9863,[20]15.7344,
Final estimate: PPL = 15.7344 +/- 0.63198

I can provide logs for other models if needed.

Repro commands:

# GGML_CPU_REPACK=ON|OFF GGML_BLAS=OFF GGML_METAL=OFF

for model in "unsloth/Qwen3-8B-128K-GGUF:Q4_0 LiquidAI/LFM2-1.2B-GGUF:Q4_0 LiquidAI/LFM2-2.6B-GGUF:Q4_0"; do
  ./bin/llama-perplexity -hf $model -f ./wikitext-2-raw/wiki.test.raw --chunks 100 -dev none
done

Other models:

# Qwen 3 Repack
perplexities_build-cpu-aarm64_Qwen3-8B-128K-GGUF:Q4_0.txt-[1]7.6803,[2]10.1811,[3]9.4260,[4]9.0666,[5]9.2647,[6]9.6980,[7]9.8774,[8]10.4100,[9]10.9424,[10]11.4185,[11]11.4938,[12]11.6893,[13]12.1807,[14]11.7433,[15]11.5808,[16]11.7468,[17]11.0987,[18]11.2603,[19]11.0962,[20]11.1735,[21]10.8974,[22]10.9181,[23]10.4976,[24]9.9920,[25]9.7800,[26]9.5234,[27]9.2917,[28]9.1358,[29]9.1840,[30]9.1386,[31]9.1237,[32]9.1164,[33]8.9839,[34]9.0213,[35]9.0949,[36]9.2154,[37]9.3887,[38]9.4512,[39]9.4129,[40]9.4693,[41]9.4650,[42]9.3915,[43]9.4305,[44]9.4552,[45]9.4605,[46]9.4598,[47]9.6747,[48]9.7829,[49]9.7476,[50]9.8248,[51]9.8489,[52]9.8696,[53]9.9087,[54]9.9802,[55]10.0069,[56]10.0701,[57]10.0272,[58]10.0532,[59]10.1151,[60]10.1645,[61]10.1961,[62]10.2441,[63]10.3282,[64]10.3811,[65]10.4620,[66]10.5540,[67]10.6382,[68]10.6259,[69]10.6246,[70]10.6129,[71]10.6290,[72]10.6951,[73]10.7189,[74]10.7327,[75]10.6625,[76]10.6244,[77]10.6562,[78]10.6942,[79]10.6103,[80]10.5880,[81]10.5408,[82]10.5761,[83]10.5308,[84]10.5104,[85]10.5348,[86]10.6326,[87]10.6827,[88]10.6733,[89]10.6861,[90]10.6783,[91]10.7371,[92]10.6980,[93]10.7394,[94]10.7430,[95]10.7241,[96]10.7199,[97]10.6880,[98]10.6990,[99]10.6692,[100]10.7254,
perplexities_build-cpu-aarm64_Qwen3-8B-128K-GGUF:Q4_0.txt:Final estimate: PPL = 10.7254 +/- 0.20427

# Qwen 3 REPACK OFF
perplexities_build-cpu-aarm64-norepack_Qwen3-8B-128K-GGUF:Q4_0.txt-[1]7.6803,[2]10.1811,[3]9.4260,[4]9.0666,[5]9.2647,[6]9.6980,[7]9.8774,[8]10.4100,[9]10.9424,[10]11.4185,[11]11.4938,[12]11.6893,[13]12.1807,[14]11.7433,[15]11.5808,[16]11.7468,[17]11.0987,[18]11.2603,[19]11.0962,[20]11.1735,[21]10.8974,[22]10.9181,[23]10.4976,[24]9.9920,[25]9.7800,[26]9.5234,[27]9.2917,[28]9.1358,[29]9.1840,[30]9.1386,[31]9.1237,[32]9.1164,[33]8.9839,[34]9.0213,[35]9.0949,[36]9.2154,[37]9.3887,[38]9.4512,[39]9.4129,[40]9.4693,[41]9.4650,[42]9.3915,[43]9.4305,[44]9.4552,[45]9.4605,[46]9.4598,[47]9.6747,[48]9.7829,[49]9.7476,[50]9.8248,[51]9.8489,[52]9.8696,[53]9.9087,[54]9.9802,[55]10.0069,[56]10.0701,[57]10.0272,[58]10.0532,[59]10.1151,[60]10.1645,[61]10.1961,[62]10.2441,[63]10.3282,[64]10.3811,[65]10.4620,[66]10.5540,[67]10.6382,[68]10.6259,[69]10.6246,[70]10.6129,[71]10.6290,[72]10.6951,[73]10.7189,[74]10.7327,[75]10.6625,[76]10.6244,[77]10.6562,[78]10.6942,[79]10.6103,[80]10.5880,[81]10.5408,[82]10.5761,[83]10.5308,[84]10.5104,[85]10.5348,[86]10.6326,[87]10.6827,[88]10.6733,[89]10.6861,[90]10.6783,[91]10.7371,[92]10.6980,[93]10.7394,[94]10.7430,[95]10.7241,[96]10.7199,[97]10.6880,[98]10.6990,[99]10.6692,[100]10.7254,
perplexities_build-cpu-aarm64-norepack_Qwen3-8B-128K-GGUF:Q4_0.txt:Final estimate: PPL = 10.7254 +/- 0.20427

# LFM2 REPACK
perplexities_build-cpu-aarm64_LFM2-2.6B-GGUF:Q4_0.txt-[1]7.0724,[2]11.2417,[3]11.3736,[4]11.0566,[5]11.2978,[6]11.8576,[7]12.1547,[8]12.8728,[9]13.8226,[10]14.0957,[11]13.6415,[12]13.7865,[13]14.1242,[14]13.5275,[15]13.2750,[16]13.1469,[17]12.3869,[18]12.5628,[19]12.4196,[20]12.2570,[21]11.8653,[22]11.8657,[23]11.5625,[24]11.3099,[25]11.2837,[26]11.0172,[27]10.9685,[28]10.9421,[29]10.8844,[30]11.0062,[31]10.9984,[32]11.1214,[33]11.0812,[34]11.0926,[35]11.0572,[36]11.1630,[37]11.3042,[38]11.1564,[39]11.3252,[40]11.2555,[41]11.2296,[42]11.2722,[43]11.3182,[44]11.2066,[45]11.2418,[46]11.3877,[47]11.5001,[48]11.4392,[49]11.4613,[50]11.5636,[51]11.5742,[52]11.5927,[53]11.6412,[54]11.6469,[55]11.7139,[56]11.7273,[57]11.7956,[58]11.8651,[59]11.9185,[60]11.9757,[61]11.9816,[62]12.0535,[63]12.1499,[64]12.2589,[65]12.3879,[66]12.4853,[67]12.4684,[68]12.4438,[69]12.4475,[70]12.4592,[71]12.5043,[72]12.5274,[73]12.5598,[74]12.5025,[75]12.4682,[76]12.4976,[77]12.5186,[78]12.4596,[79]12.3959,[80]12.3615,[81]12.4195,[82]12.4745,[83]12.4321,[84]12.4450,[85]12.5002,[86]12.5583,[87]12.5979,[88]12.5772,[89]12.5398,[90]12.5321,[91]12.4828,[92]12.5500,[93]12.5727,[94]12.5613,[95]12.5658,[96]12.5653,[97]12.5379,[98]12.5156,[99]12.5447,[100]12.5589,
perplexities_build-cpu-aarm64_LFM2-2.6B-GGUF:Q4_0.txt:Final estimate: PPL = 12.5589 +/- 0.21849

# LFM2 REPACK OFF
perplexities_build-cpu-aarm64-norepack_LFM2-2.6B-GGUF:Q4_0.txt-[1]7.0724,[2]11.2417,[3]11.3736,[4]11.0566,[5]11.2978,[6]11.8576,[7]12.1547,[8]12.8728,[9]13.8226,[10]14.0957,[11]13.6415,[12]13.7865,[13]14.1242,[14]13.5275,[15]13.2750,[16]13.1469,[17]12.3869,[18]12.5628,[19]12.4196,[20]12.2570,[21]11.8653,[22]11.8657,[23]11.5625,[24]11.3099,[25]11.2837,[26]11.0172,[27]10.9685,[28]10.9421,[29]10.8844,[30]11.0062,[31]10.9984,[32]11.1214,[33]11.0812,[34]11.0926,[35]11.0572,[36]11.1630,[37]11.3042,[38]11.1564,[39]11.3252,[40]11.2555,[41]11.2296,[42]11.2722,[43]11.3182,[44]11.2066,[45]11.2418,[46]11.3877,[47]11.5001,[48]11.4392,[49]11.4613,[50]11.5636,[51]11.5742,[52]11.5927,[53]11.6412,[54]11.6469,[55]11.7139,[56]11.7273,[57]11.7956,[58]11.8651,[59]11.9185,[60]11.9757,[61]11.9816,[62]12.0535,[63]12.1499,[64]12.2589,[65]12.3879,[66]12.4853,[67]12.4684,[68]12.4438,[69]12.4475,[70]12.4592,[71]12.5043,[72]12.5274,[73]12.5598,[74]12.5025,[75]12.4682,[76]12.4976,[77]12.5186,[78]12.4596,[79]12.3959,[80]12.3615,[81]12.4195,[82]12.4745,[83]12.4321,[84]12.4450,[85]12.5002,[86]12.5583,[87]12.5979,[88]12.5772,[89]12.5398,[90]12.5321,[91]12.4828,[92]12.5500,[93]12.5727,[94]12.5613,[95]12.5658,[96]12.5653,[97]12.5379,[98]12.5156,[99]12.5447,[100]12.5589,
perplexities_build-cpu-aarm64-norepack_LFM2-2.6B-GGUF:Q4_0.txt:Final estimate: PPL = 12.5589 +/- 0.21849

@Alcpz Alcpz changed the title ggml-cpu: handle 3d tensors in repack mul_mat ggml-cpu: handle 3d tensors in repack mat_mul Nov 5, 2025
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 5, 2025
@Alcpz Alcpz force-pushed the Alcpz/batched_repack_mul_mat branch from eadb483 to 950671d Compare November 5, 2025 18:03
@Alcpz Alcpz marked this pull request as draft November 5, 2025 18:46
@Alcpz Alcpz marked this pull request as ready for review November 6, 2025 12:24
@Alcpz
Copy link
Contributor Author

Alcpz commented Nov 10, 2025

@ggerganov This is ready for review now. Thanks for your patience.

Comment on lines +1629 to +1633

const char * src0_ptr = (const char *) src0->data + i02 * nb02;
const char * src1_ptr = (const char *) params->wdata + (i11 + i12 * ne11) * src1_col_stride;
char * dst_ptr = ((char *) dst->data + (i1 * nb1 + i2 * nb2));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add GGML_ASSERT here that guarantees we are within bounds of [params->wdata, params->wdata + params->wsize)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added one GGML_ASSERT for the upper bound. The lower bounds should always be > params->wdata since as long as ne1 and ne11 are >= 1, i11 and i12 are positive.

@Alcpz Alcpz force-pushed the Alcpz/batched_repack_mul_mat branch from 5a202b9 to d1938ad Compare November 10, 2025 20:26
@Alcpz
Copy link
Contributor Author

Alcpz commented Nov 12, 2025

@ggerganov I've addressed all your comments. Let me know if something else is required.

@ggerganov ggerganov merged commit 1c398dc into ggml-org:master Nov 12, 2025
71 checks passed
@max-krasnyansky
Copy link
Collaborator

max-krasnyansky commented Nov 13, 2025

@Alcpz
This PR causes significant performance regression for Prompt processing because it creates a lot more chunks than before.

Here is llama3.2-1B-Q4_0 running with 6 threads with instrumented matmul code.

The instrumentation simply counts number of processed chunks and the time per thread
repack-chunking-inst.diff.txt

After this PR                                 Before this PR
thread-2: Qcur-11 nchunks 38 usec 1844        thread-4: Qcur-11 nchunks 6 usec 1496
thread-3: Qcur-11 nchunks 38 usec 1844        thread-0: Qcur-11 nchunks 6 usec 1498
thread-4: Qcur-11 nchunks 17 usec 1874        thread-5: Qcur-11 nchunks 3 usec 1597
thread-5: Qcur-11 nchunks 17 usec 1948        thread-1: Qcur-11 nchunks 3 usec 1640
thread-1: Qcur-11 nchunks 17 usec 1894        thread-2: Qcur-11 nchunks 3 usec 1685
thread-0: Qcur-11 nchunks 17 usec 1876        thread-3: Qcur-11 nchunks 3 usec 1718
thread-4: Vcur-11 nchunks 17 usec 607         thread-5: Vcur-11 nchunks 6 usec 508
thread-5: Vcur-11 nchunks 17 usec 638         thread-4: Vcur-11 nchunks 6 usec 515
thread-2: Vcur-11 nchunks 39 usec 617         thread-0: Vcur-11 nchunks 3 usec 547
thread-1: Vcur-11 nchunks 15 usec 618         thread-2: Vcur-11 nchunks 3 usec 548
thread-0: Vcur-11 nchunks 17 usec 630         thread-1: Vcur-11 nchunks 3 usec 564
thread-3: Vcur-11 nchunks 39 usec 617         thread-3: Vcur-11 nchunks 3 usec 596
thread-5: Kcur-11 nchunks 38 usec 611         thread-5: Kcur-11 nchunks 6 usec 484
thread-1: Kcur-11 nchunks 17 usec 615         thread-0: Kcur-11 nchunks 6 usec 490
thread-0: Kcur-11 nchunks 17 usec 617         thread-1: Kcur-11 nchunks 3 usec 547
thread-2: Kcur-11 nchunks 17 usec 628         thread-3: Kcur-11 nchunks 3 usec 548
thread-4: Kcur-11 nchunks 38 usec 611         thread-4: Kcur-11 nchunks 3 usec 557
thread-3: Kcur-11 nchunks 17 usec 649         thread-2: Kcur-11 nchunks 3 usec 547
thread-3: attn_out-11 nchunks 38 usec 1835    thread-4: attn_out-11 nchunks 6 usec 1567
thread-5: attn_out-11 nchunks 38 usec 1847    thread-5: attn_out-11 nchunks 6 usec 1569
thread-0: attn_out-11 nchunks 17 usec 1880    thread-1: attn_out-11 nchunks 3 usec 1637
thread-4: attn_out-11 nchunks 17 usec 1886    thread-2: attn_out-11 nchunks 3 usec 1639
thread-1: attn_out-11 nchunks 17 usec 1890    thread-3: attn_out-11 nchunks 3 usec 1642
thread-2: attn_out-11 nchunks 17 usec 1897    thread-0: attn_out-11 nchunks 3 usec 1649
thread-3: ffn_gate-11 nchunks 38 usec 4886    thread-5: ffn_gate-11 nchunks 6 usec 4103
thread-2: ffn_gate-11 nchunks 38 usec 4887    thread-4: ffn_gate-11 nchunks 6 usec 4141
thread-5: ffn_gate-11 nchunks 17 usec 4992    thread-0: ffn_gate-11 nchunks 3 usec 4298
thread-1: ffn_gate-11 nchunks 17 usec 5010    thread-1: ffn_gate-11 nchunks 3 usec 4357
thread-4: ffn_gate-11 nchunks 17 usec 5010    thread-2: ffn_gate-11 nchunks 3 usec 4373
thread-0: ffn_gate-11 nchunks 17 usec 5032    thread-3: ffn_gate-11 nchunks 3 usec 4447
thread-5: ffn_up-11 nchunks 38 usec 4908      thread-0: ffn_up-11 nchunks 6 usec 4107
thread-3: ffn_up-11 nchunks 38 usec 4909      thread-5: ffn_up-11 nchunks 6 usec 4129
thread-4: ffn_up-11 nchunks 17 usec 5000      thread-1: ffn_up-11 nchunks 3 usec 4362
thread-0: ffn_up-11 nchunks 17 usec 5005      thread-4: ffn_up-11 nchunks 3 usec 4377
thread-1: ffn_up-11 nchunks 17 usec 5008      thread-3: ffn_up-11 nchunks 3 usec 4400
thread-2: ffn_up-11 nchunks 17 usec 5037      thread-2: ffn_up-11 nchunks 3 usec 4381
thread-5: ffn_out-11 nchunks 38 usec 4924     thread-5: ffn_out-11 nchunks 6 usec 4089
thread-4: ffn_out-11 nchunks 38 usec 4928     thread-0: ffn_out-11 nchunks 6 usec 4089
thread-1: ffn_out-11 nchunks 17 usec 5006     thread-3: ffn_out-11 nchunks 3 usec 4386
thread-2: ffn_out-11 nchunks 17 usec 5010     thread-2: ffn_out-11 nchunks 3 usec 4386
thread-0: ffn_out-11 nchunks 17 usec 5011     thread-4: ffn_out-11 nchunks 3 usec 4414
thread-3: ffn_out-11 nchunks 17 usec 5023     thread-1: ffn_out-11 nchunks 3 usec 4391 

That's way too many chunks and we burn a lot of time on syncrhonization.
If you have an idea for a quick fix that you can test on LFM2 please start another PR and I'll verify on my setup.
Make sure to test with Llama3.2 and Qwen3 models with instrumented code.

@Alcpz
Copy link
Contributor Author

Alcpz commented Nov 13, 2025

Mmm. Let's revert this then. I will reopen a PR with the branch as a draft and we can have a better solution. I'd rather not introduce a regression upstream. @ggerganov Mind doing the revert?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants